Unsupervised Learning - Clustering¶
Wstęp¶
Import bibliotek¶
In [ ]:
%pip install ucimlrepo
%pip install ydata-profiling
%pip install plotly
Requirement already satisfied: ucimlrepo in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (0.0.3) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: ydata-profiling in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (4.8.3) Requirement already satisfied: scipy<1.14,>=1.4.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (1.12.0) Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (2.2.1) Requirement already satisfied: matplotlib<3.9,>=3.2 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (3.8.3) Requirement already satisfied: pydantic>=2 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (2.7.1) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (6.0.1) Requirement already satisfied: jinja2<3.2,>=2.11.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (3.1.3) Requirement already satisfied: visions<0.7.7,>=0.7.5 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (0.7.6) Requirement already satisfied: numpy<2,>=1.16.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (1.26.4) Requirement already satisfied: htmlmin==0.1.12 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (0.1.12) Requirement already satisfied: phik<0.13,>=0.11.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (0.12.4) Requirement already satisfied: requests<3,>=2.24.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (2.31.0) Requirement already satisfied: tqdm<5,>=4.48.2 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (4.66.2) Requirement already satisfied: seaborn<0.14,>=0.10.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (0.13.2) Requirement already satisfied: multimethod<2,>=1.4 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (1.11.2) Requirement already satisfied: statsmodels<1,>=0.13.2 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (0.14.2) Requirement already satisfied: typeguard<5,>=3 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (4.2.1) Requirement already satisfied: imagehash==4.3.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (4.3.1) Requirement already satisfied: wordcloud>=1.9.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (1.9.3) Requirement already satisfied: dacite>=1.8 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (1.8.1) Requirement already satisfied: numba<1,>=0.56.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from ydata-profiling) (0.59.1) Requirement already satisfied: PyWavelets in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from imagehash==4.3.1->ydata-profiling) (1.6.0) Requirement already satisfied: pillow in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from imagehash==4.3.1->ydata-profiling) (10.2.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from jinja2<3.2,>=2.11.1->ydata-profiling) (2.1.5) Requirement already satisfied: contourpy>=1.0.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (4.49.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (1.4.5) Requirement already satisfied: packaging>=20.0 in c:\users\filip\appdata\roaming\python\python312\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (23.2) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in c:\users\filip\appdata\roaming\python\python312\site-packages (from matplotlib<3.9,>=3.2->ydata-profiling) (2.8.2) Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from numba<1,>=0.56.0->ydata-profiling) (0.42.0) Requirement already satisfied: pytz>=2020.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling) (2024.1) Requirement already satisfied: joblib>=0.14.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from phik<0.13,>=0.11.1->ydata-profiling) (1.3.2) Requirement already satisfied: annotated-types>=0.4.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from pydantic>=2->ydata-profiling) (0.6.0) Requirement already satisfied: pydantic-core==2.18.2 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from pydantic>=2->ydata-profiling) (2.18.2) Requirement already satisfied: typing-extensions>=4.6.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from pydantic>=2->ydata-profiling) (4.11.0) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (2.2.1) Requirement already satisfied: certifi>=2017.4.17 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from requests<3,>=2.24.0->ydata-profiling) (2024.2.2) Requirement already satisfied: patsy>=0.5.6 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from statsmodels<1,>=0.13.2->ydata-profiling) (0.5.6) Requirement already satisfied: colorama in c:\users\filip\appdata\roaming\python\python312\site-packages (from tqdm<5,>=4.48.2->ydata-profiling) (0.4.6) Requirement already satisfied: attrs>=19.3.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (23.2.0) Requirement already satisfied: networkx>=2.4 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling) (3.2.1) Requirement already satisfied: six in c:\users\filip\appdata\roaming\python\python312\site-packages (from patsy>=0.5.6->statsmodels<1,>=0.13.2->ydata-profiling) (1.16.0) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: plotly in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (5.22.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\filip\appdata\local\programs\python\python312\lib\site-packages (from plotly) (8.2.3) Requirement already satisfied: packaging in c:\users\filip\appdata\roaming\python\python312\site-packages (from plotly) (23.2) Note: you may need to restart the kernel to use updated packages.
In [ ]:
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import pandas as pd
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn import metrics
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
random = 42
generate_report = False
Wczytanie zbioru¶
In [ ]:
# fetch dataset
abalone = fetch_ucirepo(id=1)
# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets
# metadata
print(abalone.metadata)
# variable information
print(abalone.variables)
{'uci_id': 1, 'name': 'Abalone', 'repository_url': 'https://archive.ics.uci.edu/dataset/1/abalone', 'data_url': 'https://archive.ics.uci.edu/static/public/1/data.csv', 'abstract': 'Predict the age of abalone from physical measurements', 'area': 'Biology', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Tabular'], 'num_instances': 4177, 'num_features': 8, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['Rings'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C55C7W', 'creators': ['Warwick Nash', 'Tracy Sellers', 'Simon Talbot', 'Andrew Cawthorn', 'Wes Ford'], 'intro_paper': None, 'additional_info': {'summary': 'Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.\r\n\r\nFrom the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).', 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.\r\n\r\nName / Data Type / Measurement Unit / Description\r\n-----------------------------\r\nSex / nominal / -- / M, F, and I (infant)\r\nLength / continuous / mm / Longest shell measurement\r\nDiameter\t/ continuous / mm / perpendicular to length\r\nHeight / continuous / mm / with meat in shell\r\nWhole weight / continuous / grams / whole abalone\r\nShucked weight / continuous\t / grams / weight of meat\r\nViscera weight / continuous / grams / gut weight (after bleeding)\r\nShell weight / continuous / grams / after being dried\r\nRings / integer / -- / +1.5 gives the age in years\r\n\r\nThe readme file contains attribute statistics.', 'citation': None}}
name role type demographic \
0 Sex Feature Categorical None
1 Length Feature Continuous None
2 Diameter Feature Continuous None
3 Height Feature Continuous None
4 Whole_weight Feature Continuous None
5 Shucked_weight Feature Continuous None
6 Viscera_weight Feature Continuous None
7 Shell_weight Feature Continuous None
8 Rings Target Integer None
description units missing_values
0 M, F, and I (infant) None no
1 Longest shell measurement mm no
2 perpendicular to length mm no
3 with meat in shell mm no
4 whole abalone grams no
5 weight of meat grams no
6 gut weight (after bleeding) grams no
7 after being dried grams no
8 +1.5 gives the age in years None no
Podział na zbiór treningowy i testowy¶
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random)
In [ ]:
df_to_profile = pd.concat([X_train, y_train], axis=1)
In [ ]:
if generate_report:
# Generate the profile report
profile = ProfileReport(df_to_profile)
# Display the profile report
profile.to_widgets()
Wnioski z AutoEDA¶
- Zbiór nie posiada wartości brakujących
- Posiada jedną kolumnę kategoryczną, jest to płeć słuchotki (Uchowca) - rodzina ślimaków morskich - rozkład równomierny
- Wszystkie pozostałe są numeryczne
- Height posiada wartości odstające:
- Height i Rings posiada rozkłady normalne symetryczne a reszta normalne skośne.
- Z racji, że zbiór był pod zadanie regresji liczby pierścieni, to można zauważyć pozytywną korelację liniową innych cech z Rings.
Basic Analysis¶
In [ ]:
df_to_profile.head()
Out[ ]:
| Sex | Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | Rings | |
|---|---|---|---|---|---|---|---|---|---|
| 4038 | I | 0.550 | 0.445 | 0.125 | 0.6720 | 0.2880 | 0.1365 | 0.210 | 11 |
| 1272 | I | 0.475 | 0.355 | 0.100 | 0.5035 | 0.2535 | 0.0910 | 0.140 | 8 |
| 3384 | F | 0.305 | 0.225 | 0.070 | 0.1485 | 0.0585 | 0.0335 | 0.045 | 7 |
| 3160 | I | 0.275 | 0.200 | 0.065 | 0.1165 | 0.0565 | 0.0130 | 0.035 | 7 |
| 3894 | M | 0.495 | 0.380 | 0.135 | 0.6295 | 0.2630 | 0.1425 | 0.215 | 12 |
In [ ]:
# descirbe the data
df_to_profile.describe()
Out[ ]:
| Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | Rings | |
|---|---|---|---|---|---|---|---|---|
| count | 3341.000000 | 3341.000000 | 3341.000000 | 3341.000000 | 3341.000000 | 3341.000000 | 3341.000000 | 3341.000000 |
| mean | 0.524964 | 0.408518 | 0.139790 | 0.830838 | 0.360561 | 0.180832 | 0.239682 | 9.944627 |
| std | 0.119137 | 0.098687 | 0.042514 | 0.491583 | 0.223018 | 0.109444 | 0.139941 | 3.207344 |
| min | 0.075000 | 0.055000 | 0.000000 | 0.002000 | 0.001000 | 0.000500 | 0.001500 | 1.000000 |
| 25% | 0.450000 | 0.350000 | 0.115000 | 0.443000 | 0.186500 | 0.093000 | 0.130000 | 8.000000 |
| 50% | 0.545000 | 0.425000 | 0.140000 | 0.802000 | 0.337000 | 0.171000 | 0.234000 | 9.000000 |
| 75% | 0.615000 | 0.480000 | 0.165000 | 1.151000 | 0.503500 | 0.253500 | 0.328500 | 11.000000 |
| max | 0.815000 | 0.650000 | 1.130000 | 2.825500 | 1.488000 | 0.760000 | 1.005000 | 29.000000 |
In [ ]:
# list all numerical columns
%matplotlib inline
numerical_columns = df_to_profile.select_dtypes(include=[np.number]).columns.tolist()
corr = df_to_profile[numerical_columns].corr()
plt.figure(figsize=(10, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
- Wiele cech jest mocno poztywnie skorelowanych ze sobą, oznacza to że są redundantne i można jes zastąpić jedną cechą.
Preprocessing danych¶
Normalizacja¶
In [ ]:
from sklearn.preprocessing import MinMaxScaler
X_train_n = X_train.copy()
X_test_n = X_test.copy()
numeric_cols = X_train.select_dtypes(include=[np.number]).columns
scaler = MinMaxScaler()
df_normalized_train = scaler.fit_transform(X_train[numeric_cols])
X_train_n[numeric_cols] = df_normalized_train
df_normalized_test = scaler.transform(X_test[numeric_cols])
X_test_n[numeric_cols] = df_normalized_test
y_train_n = y_train.copy()
y_test_n = y_test.copy()
scaler_y = MinMaxScaler()
df_normalized_train_y = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
y_train_n = df_normalized_train_y
df_normalized_test_y = scaler_y.transform(y_test.values.reshape(-1, 1))
y_test_n = df_normalized_test_y
Kodowanie zmiennych kategorycznych¶
In [ ]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X_train_n["Sex"] = le.fit_transform(X_train_n["Sex"])
X_test_n["Sex"] = le.transform(X_test_n["Sex"])
In [ ]:
X_train_n.head()
Out[ ]:
| Sex | Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | |
|---|---|---|---|---|---|---|---|---|
| 4038 | 1 | 0.641892 | 0.655462 | 0.110619 | 0.237294 | 0.193006 | 0.179065 | 0.207773 |
| 1272 | 1 | 0.540541 | 0.504202 | 0.088496 | 0.177616 | 0.169805 | 0.119157 | 0.138017 |
| 3384 | 0 | 0.310811 | 0.285714 | 0.061947 | 0.051886 | 0.038668 | 0.043450 | 0.043348 |
| 3160 | 1 | 0.270270 | 0.243697 | 0.057522 | 0.040553 | 0.037323 | 0.016458 | 0.033383 |
| 3894 | 2 | 0.567568 | 0.546218 | 0.119469 | 0.222242 | 0.176194 | 0.186965 | 0.212755 |
In [ ]:
X_test_n.head()
Out[ ]:
| Sex | Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | |
|---|---|---|---|---|---|---|---|---|
| 866 | 2 | 0.716216 | 0.672269 | 0.141593 | 0.390119 | 0.282448 | 0.396313 | 0.322372 |
| 1483 | 2 | 0.695946 | 0.647059 | 0.132743 | 0.308305 | 0.259583 | 0.282423 | 0.242651 |
| 599 | 0 | 0.655405 | 0.655462 | 0.172566 | 0.346733 | 0.204438 | 0.294931 | 0.332337 |
| 1702 | 0 | 0.756757 | 0.731092 | 0.150442 | 0.446078 | 0.361466 | 0.350230 | 0.377180 |
| 670 | 2 | 0.540541 | 0.554622 | 0.128319 | 0.217992 | 0.157364 | 0.141540 | 0.212755 |
PCA a wymiary¶
In [ ]:
from sklearn.cluster import KMeans
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def plot_variance_explained(df):
pca = PCA()
pca.fit(df)
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
fig, axs = plt.subplots(2, 1, figsize=(10, 8))
axs[0].plot(range(1, len(explained_variance) + 1), explained_variance, marker='o')
axs[0].set_xlabel('Principal Components')
axs[0].set_ylabel('Explained Variance')
axs[0].set_title('Explained Variance')
axs[0].grid(True)
axs[1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
axs[1].set_xlabel('Principal Components')
axs[1].set_ylabel('Cumulative Variance Explained')
axs[1].set_title('Cumulative Variance Explained')
axs[1].grid(True)
plt.tight_layout()
plt.show()
In [ ]:
plot_variance_explained(X_train_n)
KMeans¶
Działanie algorytmu¶
- Wybierz liczbę klastrów K
- Inicjalizuj K punktów jako centroidy
- może być losowo
- lub aplikując k-means++ (lepsze wyniki)
- Dla każdego punktu przypisz go do najbliższego centroidu
- Zaktualizuj centroidy jako średnią wszystkich punktów przypisanych do danego klastra
- Powtarzaj kroki 3 i 4 aż do zbieżności
Jak dobrać ilość klastrów w KMeans ?¶
- Metoda łokcia (Elbow Method) - polega na wyznaczeniu optymalnej ilości klastrów na podstawie wykresu zależności Inertia (Suma kwadratów odległości próbek od ich najbliższego centroidu) od ilości klastrów. Optymalna ilość klastrów to punkt przegięcia wykresu.
In [ ]:
def plot_elbow(df):
inertia = []
k_values = range(1, 20)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=random)
kmeans.fit(df)
inertia.append(kmeans.inertia_)
plt.plot(k_values, inertia, marker='o')
plt.xticks(k_values)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Plot')
plt.grid(True)
plt.show()
In [ ]:
plot_elbow(X_train_n)
c:\Users\filip\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning: Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
warnings.warn(
File "c:\Users\filip\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
raise ValueError(f"found {cpu_count_physical} physical cores < 1")
- Można też użyć innych metryk jak Silhouette Score, Calinski-Harabasz Score, Davies-Bouldin Score.
Sihouette Score¶
Silhouette Score jest metryką używaną do oceny jakości podziału punktów danych na klastry.
Wartość Silhouette Score mieści się w przedziale od -1 do 1, gdzie:
- bliskie 1 oznacza, że obserwacje są dobrze przypisane do klastrów i źle przypisane do innych klastrów
- bliskie 0 oznacza, że obserwacje są równo przypisane do klastrów
- bliskie -1 oznacza, że obserwacje są źle przypisane do klastrów
miara słada się z:
- a - średnia odległość między punktem a innymi punktami w klastrze
- b - średnia odległość między punktem a punktami z innego najbliższego klastra Wzór na Silhouette Score: $$s = \frac{b - a}{max(a, b)}$$
Calinski-Harabasz Score¶
- Calinski-Harabasz Score jest metryką wewnętrzną, która mierzy kompakcyjność klastrów i odległości między nimi.
- Im wyższa wartość Calinski-Harabasz Score, tym lepszy podział punktów na klastry.
- Metryka ta oblicza stosunek wariancji między klastrami do wariancji wewnątrz klastrów.
Davies-Bouldin Score¶
- Davies-Bouldin Score jest również metryką wewnętrzną, która mierzy jakość podziału klastrów.
- Im niższa wartość Davies-Bouldin Score, tym lepsza sepraca między klastrami i bardziej jednorodne klastry.
- DB Score oblicza średnią wartość odległości między każdym klastrami, podzieloną przez odległość między centroidami klastrów.
- Obliczeniowo DB jest łatwiejszy niż Sihouette Score.
In [ ]:
def plot_metrics_kmeans(df: pd.DataFrame):
silhouette_scores_KMEANS = []
calinski_harabasz_scores_KMEANS = []
davies_bouldin_scores_KMEANS = []
for i in range(2, 11):
kmeans = KMeans(n_clusters=i, random_state=random)
kmeans.fit(df)
y_kmeans = kmeans.predict(df)
silhouette_scores_KMEANS.append(silhouette_score(df, y_kmeans))
calinski_harabasz_scores_KMEANS.append(calinski_harabasz_score(df, y_kmeans))
davies_bouldin_scores_KMEANS.append(davies_bouldin_score(df, y_kmeans))
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
axs[0].plot(range(2, 11), silhouette_scores_KMEANS, marker="o", color="b")
axs[0].set_title("Silhouette Score")
axs[0].set_xlabel("Number of clusters")
best_k = silhouette_scores_KMEANS.index(max(silhouette_scores_KMEANS)) + 2
axs[0].axvline(x=best_k, color="r", linestyle="-.")
axs[0].axhline(y=max(silhouette_scores_KMEANS), color="g", linestyle="--")
axs[0].set_ylabel("Score")
axs[0].grid(True)
axs[1].plot(range(2, 11), calinski_harabasz_scores_KMEANS, marker="o", color="r")
best_k = (
calinski_harabasz_scores_KMEANS.index(max(calinski_harabasz_scores_KMEANS)) + 2
)
axs[1].axvline(x=best_k, color="r", linestyle="-.")
axs[1].axhline(y=max(calinski_harabasz_scores_KMEANS), color="g", linestyle="--")
axs[1].set_title("Calinski Harabasz Score")
axs[1].set_xlabel("Number of clusters")
axs[1].set_ylabel("Score")
axs[1].grid(True)
best_k = davies_bouldin_scores_KMEANS.index(min(davies_bouldin_scores_KMEANS)) + 2
axs[2].plot(range(2, 11), davies_bouldin_scores_KMEANS, marker="o", color="g")
axs[2].axvline(x=best_k, color="r", linestyle="-.")
axs[2].axhline(y=min(davies_bouldin_scores_KMEANS), color="g", linestyle="--")
axs[2].set_title("Davies Bouldin Score")
axs[2].set_xlabel("Number of clusters")
axs[2].set_ylabel("Score")
axs[2].grid(True)
plt.tight_layout()
plt.show()
In [ ]:
plot_metrics_kmeans(X_train_n)
Wizualizacja grupowania poprzez dwie cechy a redukcję cech¶
In [ ]:
kmeans = KMeans(n_clusters=3, random_state=random)
kmeans.fit(X_train_n)
y_kmeans = kmeans.predict(X_train_n)
df_clustered = X_train_n.copy()
df_clustered['Cluster'] = y_kmeans
feat1 = "Viscera_weight"
feat2 = "Shell_weight"
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_clustered, x=feat1, y=feat2, hue='Cluster', palette='viridis', alpha=0.7)
plt.title(f'{feat1} vs {feat2} with KMeans Clustering')
plt.show()
# Plot clusters using PCA
pca = PCA(n_components=2)
pca.fit(X_train_n)
df_pca = pca.transform(X_train_n)
df_pca = pd.DataFrame(df_pca, columns=['PC1', 'PC2'])
df_pca['Cluster'] = y_kmeans
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_pca, x='PC1', y='PC2', hue=df_pca['Cluster'], palette='viridis', alpha=0.7)
plt.title('PCA with KMeans Clustering')
plt.show()
Wszytskie pary cech¶
In [ ]:
sns.pairplot(df_clustered, hue='Cluster', palette='viridis')
plt.title('Pairplot with KMeans Clustering')
plt.show()
Grupowanie po redukcji cech¶
In [ ]:
# use pca to do dimension reduction
pca = PCA(n_components=3, random_state=random)
principalComponents = pca.fit_transform(X_train_n)
display(X_train_n.head())
# Create a DataFrame with the principal components
principalDf3 = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2', 'PC3'])
# KMeans clustering
kmeans_pca = KMeans(n_clusters=3, random_state=random)
kmeans_pca.fit(principalDf3)
y_kmeans_pca = kmeans_pca.predict(principalDf3)
# Create a DataFrame with the cluster labels
df_clustered_pca = principalDf3.copy()
df_clustered_pca['Cluster'] = y_kmeans_pca
# Plot clusters using PCA also add centroids
fig = px.scatter_3d(df_clustered_pca, x='PC1', y='PC2', z='PC3', color='Cluster', opacity=0.5, width=800, height=800)
fig.show()
| Sex | Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | |
|---|---|---|---|---|---|---|---|---|
| 4038 | 1 | 0.641892 | 0.655462 | 0.110619 | 0.237294 | 0.193006 | 0.179065 | 0.207773 |
| 1272 | 1 | 0.540541 | 0.504202 | 0.088496 | 0.177616 | 0.169805 | 0.119157 | 0.138017 |
| 3384 | 0 | 0.310811 | 0.285714 | 0.061947 | 0.051886 | 0.038668 | 0.043450 | 0.043348 |
| 3160 | 1 | 0.270270 | 0.243697 | 0.057522 | 0.040553 | 0.037323 | 0.016458 | 0.033383 |
| 3894 | 2 | 0.567568 | 0.546218 | 0.119469 | 0.222242 | 0.176194 | 0.186965 | 0.212755 |
Wpływ cechy tergetowej na klastrowanie¶
- Sprawdzenie czy cecha targetowa ma wpływ na klastrowanie
- okazuje się, że różnica cech kategorycznych jest na tyle zauważalna, że nie ma to wpływu na klastrowanie
In [ ]:
XY_train_n = X_train_n.copy()
XY_train_n['Rings'] = y_train_n
XY_test_n = X_test_n.copy()
XY_test_n['Rings'] = y_test_n
In [ ]:
pca = PCA(n_components=3, random_state=random)
principalComponents = pca.fit_transform(XY_train_n)
display(XY_train_n.head())
principalDf3 = pd.DataFrame(data=principalComponents, columns=["PC1", "PC2", "PC3"])
# KMeans clustering
kmeans_pca = KMeans(n_clusters=3, random_state=random)
kmeans_pca.fit(principalDf3)
y_kmeans_pca = kmeans_pca.predict(principalDf3)
df_clustered_pca = principalDf3.copy()
df_clustered_pca["Cluster"] = y_kmeans_pca
fig = px.scatter_3d(
df_clustered_pca,
x="PC1",
y="PC2",
z="PC3",
color="Cluster",
opacity=0.5,
width=800,
height=800,
)
fig.show()
| Sex | Length | Diameter | Height | Whole_weight | Shucked_weight | Viscera_weight | Shell_weight | Rings | |
|---|---|---|---|---|---|---|---|---|---|
| 4038 | 1 | 0.641892 | 0.655462 | 0.110619 | 0.237294 | 0.193006 | 0.179065 | 0.207773 | 0.357143 |
| 1272 | 1 | 0.540541 | 0.504202 | 0.088496 | 0.177616 | 0.169805 | 0.119157 | 0.138017 | 0.250000 |
| 3384 | 0 | 0.310811 | 0.285714 | 0.061947 | 0.051886 | 0.038668 | 0.043450 | 0.043348 | 0.214286 |
| 3160 | 1 | 0.270270 | 0.243697 | 0.057522 | 0.040553 | 0.037323 | 0.016458 | 0.033383 | 0.214286 |
| 3894 | 2 | 0.567568 | 0.546218 | 0.119469 | 0.222242 | 0.176194 | 0.186965 | 0.212755 | 0.392857 |
Analiza hiperparametrów metod wraz z odpowiednimi wizualizacjami danych.¶
Liczba klastrów¶
In [ ]:
import matplotlib.animation as animation
from mpl_toolkits.mplot3d import Axes3D
def animate_hiperparameters(dataframe, test):
# Perform PCA to reduce to 3 dimensions for visualization
pca = PCA(n_components=3)
principalDf3 = pd.DataFrame(
pca.fit_transform(dataframe), columns=["PC1", "PC2", "PC3"]
)
pca_test = pd.DataFrame(
pca.transform(test), columns=["PC1", "PC2", "PC3"]
)
# Initialize lists to store metrics
silhouette_scores_KMEANS = []
calinski_harabasz_scores_KMEANS = []
davies_bouldin_scores_KMEANS = []
silhouette_scores_KMEANS_test = []
calinski_harabasz_scores_KMEANS_test = []
davies_bouldin_scores_KMEANS_test = []
for i in range(2, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(dataframe)
y_kmeans = kmeans.predict(dataframe)
y_test_kmeans = kmeans.predict(test)
silhouette_scores_KMEANS.append(silhouette_score(dataframe, y_kmeans))
calinski_harabasz_scores_KMEANS.append(calinski_harabasz_score(dataframe, y_kmeans))
davies_bouldin_scores_KMEANS.append(davies_bouldin_score(dataframe, y_kmeans))
silhouette_scores_KMEANS_test.append(silhouette_score(test, y_test_kmeans))
calinski_harabasz_scores_KMEANS_test.append(calinski_harabasz_score(test, y_test_kmeans))
davies_bouldin_scores_KMEANS_test.append(davies_bouldin_score(test, y_test_kmeans))
fig = plt.figure(figsize=(20, 10))
ax1 = fig.add_subplot(231, projection="3d")
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_zlabel("PC3")
ax1.set_title("Train Clustering with PCA")
ax2 = fig.add_subplot(234, projection="3d")
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
ax2.set_zlabel("PC3")
ax2.set_title("Test Clustering with PCA")
ax3 = fig.add_subplot(232)
ax4 = fig.add_subplot(233)
ax5 = fig.add_subplot(235)
def update(num):
ax1.cla()
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_zlabel("PC3")
ax1.set_title(f"Train Clustering with PCA - {num} clusters")
ax2.cla()
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
ax2.set_zlabel("PC3")
ax2.set_title(f"Test Clustering with PCA - {num} clusters")
kmeans_pca = KMeans(n_clusters=num, random_state=42)
kmeans_pca.fit(principalDf3)
y_kmeans_pca = kmeans_pca.predict(principalDf3)
y_kmeans_pca_test = kmeans_pca.predict(pca_test)
df_clustered_pca = principalDf3.copy()
df_clustered_pca["Cluster"] = y_kmeans_pca
df_clustered_pca_test = pca_test.copy()
df_clustered_pca_test["Cluster"] = y_kmeans_pca_test
scatter1 = ax1.scatter(
df_clustered_pca["PC1"],
df_clustered_pca["PC2"],
df_clustered_pca["PC3"],
c=df_clustered_pca["Cluster"],
cmap="viridis",
alpha=0.7,
)
legend1 = ax1.legend(*scatter1.legend_elements(), title="Clusters")
ax1.add_artist(legend1)
scatter2 = ax2.scatter(
df_clustered_pca_test["PC1"],
df_clustered_pca_test["PC2"],
df_clustered_pca_test["PC3"],
c=df_clustered_pca_test["Cluster"],
cmap="viridis",
alpha=0.7,
)
legend2 = ax2.legend(*scatter2.legend_elements(), title="Clusters")
ax2.add_artist(legend2)
ax3.cla()
ax4.cla()
ax5.cla()
ax3.plot(range(2, 11), silhouette_scores_KMEANS, marker="o", color="b", label="Train")
ax3.plot(range(2, 11), silhouette_scores_KMEANS_test, marker="o", color="k", label="Test")
ax3.set_title("Silhouette Score")
ax3.set_xlabel("Number of clusters")
ax3.set_ylabel("Score")
ax3.grid(True)
ax3.axvline(x=num, color="r", linestyle="-.")
ax3.legend()
ax4.plot(range(2, 11), calinski_harabasz_scores_KMEANS, marker="o", color="r", label="Train")
ax4.plot(range(2, 11), calinski_harabasz_scores_KMEANS_test, marker="o", color="k", label="Test")
ax4.set_title("Calinski Harabasz Score")
ax4.set_xlabel("Number of clusters")
ax4.set_ylabel("Score")
ax4.grid(True)
ax4.axvline(x=num, color="r", linestyle="-.")
ax4.legend()
ax5.plot(range(2, 11), davies_bouldin_scores_KMEANS, marker="o", color="g", label="Train")
ax5.plot(range(2, 11), davies_bouldin_scores_KMEANS_test, marker="o", color="k", label="Test")
ax5.set_title("Davies Bouldin Score")
ax5.set_xlabel("Number of clusters")
ax5.set_ylabel("Score")
ax5.grid(True)
ax5.axvline(x=num, color="r", linestyle="-.")
ax5.legend()
ani = animation.FuncAnimation(fig, update, frames=range(2,11), repeat=True)
ani.save("kmeans_pca.gif", writer="imagemagick", fps=1)
anim = animate_hiperparameters(XY_train_n, XY_test_n)
from IPython.display import Image
Image(filename="kmeans_pca.gif")
MovieWriter imagemagick unavailable; using Pillow instead.
Out[ ]:
<IPython.core.display.Image object>
Init - metoda inicjalizacji centroidów¶
In [ ]:
def test_kmeans_init_hyperparameter(dataframe, test):
# Different init methods to test
# use pca to
init_methods = ["k-means++", "random"]
# Initialize lists to store metrics for different init methods
silhouette_scores = {method: [] for method in init_methods}
calinski_harabasz_scores = {method: [] for method in init_methods}
davies_bouldin_scores = {method: [] for method in init_methods}
silhouette_scores_test = {method: [] for method in init_methods}
calinski_harabasz_scores_test = {method: [] for method in init_methods}
davies_bouldin_scores_test = {method: [] for method in init_methods}
# Calculate metrics for different init methods and number of clusters
for method in init_methods:
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, init=method, random_state=42)
kmeans.fit(dataframe)
y_kmeans = kmeans.predict(dataframe)
y_test_kmeans = kmeans.predict(test)
silhouette_scores[method].append(silhouette_score(dataframe, y_kmeans))
calinski_harabasz_scores[method].append(
calinski_harabasz_score(dataframe, y_kmeans)
)
davies_bouldin_scores[method].append(
davies_bouldin_score(dataframe, y_kmeans)
)
silhouette_scores_test[method].append(silhouette_score(test, y_test_kmeans))
calinski_harabasz_scores_test[method].append(
calinski_harabasz_score(test, y_test_kmeans)
)
davies_bouldin_scores_test[method].append(
davies_bouldin_score(test, y_test_kmeans)
)
fig, axs = plt.subplots(3, 1, figsize=(10, 15))
for method in init_methods:
axs[0].plot(range(2, 11), silhouette_scores[method], marker="o", label=method)
axs[0].plot(range(2, 11), silhouette_scores_test[method], marker="o", label=method+" test")
axs[1].plot(
range(2, 11), calinski_harabasz_scores[method], marker="o", label=method
)
axs[1].plot(
range(2, 11), calinski_harabasz_scores_test[method], marker="o", label=method+" test"
)
axs[2].plot(
range(2, 11), davies_bouldin_scores[method], marker="o", label=method
)
axs[2].plot(
range(2, 11), davies_bouldin_scores_test[method], marker="o", label=method+" test"
)
axs[0].set_title("Silhouette Score")
axs[0].set_xlabel("Number of clusters")
axs[0].set_ylabel("Score")
axs[0].legend()
axs[0].grid(True)
axs[1].set_title("Calinski Harabasz Score")
axs[1].set_xlabel("Number of clusters")
axs[1].set_ylabel("Score")
axs[1].legend()
axs[1].grid(True)
axs[2].set_title("Davies Bouldin Score")
axs[2].set_xlabel("Number of clusters")
axs[2].set_ylabel("Score")
axs[2].legend()
axs[2].grid(True)
plt.tight_layout()
plt.show()
test_kmeans_init_hyperparameter(XY_train_n, XY_test_n)
Badanie parametru iteracji¶
In [ ]:
def test_kmeans_max_iter_hyperparameter(dataframe, test):
max_iter_values = [1, 2, 4, 8, 16]
silhouette_scores = {max_iter: [] for max_iter in max_iter_values}
calinski_harabasz_scores = {max_iter: [] for max_iter in max_iter_values}
davies_bouldin_scores = {max_iter: [] for max_iter in max_iter_values}
silhouette_scores_test = {max_iter: [] for max_iter in max_iter_values}
calinski_harabasz_scores_test = {max_iter: [] for max_iter in max_iter_values}
davies_bouldin_scores_test = {max_iter: [] for max_iter in max_iter_values}
for max_iter in max_iter_values:
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, max_iter=max_iter, random_state=random)
kmeans.fit(dataframe)
y_kmeans = kmeans.predict(dataframe)
y_test_kmeans = kmeans.predict(test)
silhouette_scores[max_iter].append(silhouette_score(dataframe, y_kmeans))
calinski_harabasz_scores[max_iter].append(
calinski_harabasz_score(dataframe, y_kmeans)
)
davies_bouldin_scores[max_iter].append(
davies_bouldin_score(dataframe, y_kmeans)
)
silhouette_scores_test[max_iter].append(silhouette_score(test, y_test_kmeans))
calinski_harabasz_scores_test[max_iter].append(
calinski_harabasz_score(test, y_test_kmeans)
)
davies_bouldin_scores_test[max_iter].append(
davies_bouldin_score(test, y_test_kmeans)
)
fig, axs = plt.subplots(3, 2, figsize=(10, 15))
for max_iter in max_iter_values:
axs[0][0].plot(
range(2, 11),
silhouette_scores[max_iter],
marker="o",
label=f"max_iter={max_iter}",
)
axs[1][0].plot(
range(2, 11),
calinski_harabasz_scores[max_iter],
marker="o",
label=f"max_iter={max_iter}",
)
axs[2][0].plot(
range(2, 11),
davies_bouldin_scores[max_iter],
marker="o",
label=f"max_iter={max_iter}",
)
for max_iter in max_iter_values:
axs[0][1].plot(
range(2, 11),
silhouette_scores_test[max_iter],
marker="o",
label=f"max_iter={max_iter} test",
)
axs[1][1].plot(
range(2, 11),
calinski_harabasz_scores_test[max_iter],
marker="o",
label=f"max_iter={max_iter} test",
)
axs[2][1].plot(
range(2, 11),
davies_bouldin_scores_test[max_iter],
marker="o",
label=f"max_iter={max_iter} test",
)
axs[0][0].set_title("Silhouette Score")
axs[0][0].set_xlabel("Number of clusters")
axs[0][0].set_ylabel("Score")
axs[0][0].legend()
axs[0][0].grid(True)
axs[1][0].set_title("Calinski Harabasz Score")
axs[1][0].set_xlabel("Number of clusters")
axs[1][0].set_ylabel("Score")
axs[1][0].legend()
axs[1][0].grid(True)
axs[2][0].set_title("Davies Bouldin Score")
axs[2][0].set_xlabel("Number of clusters")
axs[2][0].set_ylabel("Score")
axs[2][0].legend()
axs[2][0].grid(True)
axs[0][1].set_title("Silhouette Score Test")
axs[0][1].set_xlabel("Number of clusters")
axs[0][1].set_ylabel("Score")
axs[0][1].legend()
axs[0][1].grid(True)
axs[1][1].set_title("Calinski Harabasz Score Test")
axs[1][1].set_xlabel("Number of clusters")
axs[1][1].set_ylabel("Score")
axs[1][1].legend()
axs[1][1].grid(True)
axs[2][1].set_title("Davies Bouldin Score Test")
axs[2][1].set_xlabel("Number of clusters")
axs[2][1].set_ylabel("Score")
axs[2][1].legend()
axs[2][1].grid(True)
plt.tight_layout()
plt.show()
# Call the function with the sample dataframe
test_kmeans_max_iter_hyperparameter(XY_train_n, XY_test_n)
Tolerancja¶
In [ ]:
def test_kmeans_tol_hyperparameter(dataframe, test):
tol_values = [1e-4, 1e-2, 1]
silhouette_scores = {tol: [] for tol in tol_values}
calinski_harabasz_scores = {tol: [] for tol in tol_values}
davies_bouldin_scores = {tol: [] for tol in tol_values}
silhouette_scores_test = {max_iter: [] for max_iter in tol_values}
calinski_harabasz_scores_test = {max_iter: [] for max_iter in tol_values}
davies_bouldin_scores_test = {max_iter: [] for max_iter in tol_values}
for tol in tol_values:
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, tol=tol, random_state=42)
kmeans.fit(dataframe)
y_kmeans = kmeans.predict(dataframe)
y_test_kmeans = kmeans.predict(test)
silhouette_scores[tol].append(silhouette_score(dataframe, y_kmeans))
calinski_harabasz_scores[tol].append(
calinski_harabasz_score(dataframe, y_kmeans)
)
davies_bouldin_scores[tol].append(davies_bouldin_score(dataframe, y_kmeans))
silhouette_scores_test[tol].append(silhouette_score(test, y_test_kmeans))
calinski_harabasz_scores_test[tol].append(
calinski_harabasz_score(test, y_test_kmeans)
)
davies_bouldin_scores_test[tol].append(
davies_bouldin_score(test, y_test_kmeans)
)
fig, axs = plt.subplots(3, 2, figsize=(15, 15))
for tol in tol_values:
axs[0][0].plot(
range(2, 11), silhouette_scores[tol], marker="o", label=f"tol={tol}"
)
axs[1][0].plot(
range(2, 11), calinski_harabasz_scores[tol], marker="o", label=f"tol={tol}"
)
axs[2][0].plot(
range(2, 11), davies_bouldin_scores[tol], marker="o", label=f"tol={tol}"
)
for tol in tol_values:
axs[0][1].plot(
range(2, 11),
silhouette_scores_test[tol],
marker="o",
label=f"tol={tol} test",
)
axs[1][1].plot(
range(2, 11),
calinski_harabasz_scores_test[tol],
marker="o",
label=f"tol={tol} test",
)
axs[2][1].plot(
range(2, 11),
davies_bouldin_scores_test[tol],
marker="o",
label=f"tol={tol} test",
)
axs[0][0].set_title("Silhouette Score")
axs[0][0].set_xlabel("Number of clusters")
axs[0][0].set_ylabel("Score")
axs[0][0].legend()
axs[0][0].grid(True)
axs[1][0].set_title("Calinski Harabasz Score")
axs[1][0].set_xlabel("Number of clusters")
axs[1][0].set_ylabel("Score")
axs[1][0].legend()
axs[1][0].grid(True)
axs[2][0].set_title("Davies Bouldin Score")
axs[2][0].set_xlabel("Number of clusters")
axs[2][0].set_ylabel("Score")
axs[2][0].legend()
axs[2][0].grid(True)
axs[0][1].set_title("Silhouette Score Test")
axs[0][1].set_xlabel("Number of clusters")
axs[0][1].set_ylabel("Score")
axs[0][1].legend()
axs[0][1].grid(True)
axs[1][1].set_title("Calinski Harabasz Score Test")
axs[1][1].set_xlabel("Number of clusters")
axs[1][1].set_ylabel("Score")
axs[1][1].legend()
axs[1][1].grid(True)
axs[2][1].set_title("Davies Bouldin Score Test")
axs[2][1].set_xlabel("Number of clusters")
axs[2][1].set_ylabel("Score")
axs[2][1].legend()
axs[2][1].grid(True)
plt.tight_layout()
plt.show()
test_kmeans_tol_hyperparameter(XY_train_n, XY_test_n)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)¶
- jest popularnym algorytmem grupowania, który działa na zasadzie analizy gęstości przestrzennej punktów danych.
- DBSCAN ma dwie ważne parametry: epsilon (ε) i min_samples.
- epsilon (ε) - promień okręgu wokół punktu, w którym szukamy innych punktów
- min_samples - minimalna liczba punktów w promieniu epsilon, aby uznać punkt za rdzeniowy
- DBSCAN ma trzy rodzaje punktów:
- punkty rdzeniowe - punkty, które mają co najmniej min_samples punktów w promieniu epsilon
- punkty graniczne - punkty, które mają mniej niż min_samples punktów w promieniu epsilon, ale są w promieniu epsilon innego punktu rdzeniowego
- punkty szumu - punkty, które nie są ani punktami rdzeniowymi ani punktami granicznymi
- Punkty szumowe są ignorowane lub mogą tworzyć osobny klaster dla szumu.
- DBSCAN potrafi radzić sobie z klastrami o nieregularnych kształtach, co jest wyzwaniem dla algorytmów opartych na odległościach, takich jak KMeans.
Jak określić wartości parametrów eps i min_samples?
epspodobnie metoda łokcia dla dystansów między najbliższymi sąsiadami. Posortować po odległościach i wybrać punkt przegięcia.min_samples- zależy od wymiarów danych, >= 2 * liczba wymiarów. lub 1 + liczba wymiarów.
In [ ]:
neighbours = NearestNeighbors(n_neighbors=10)
neighbours.fit(XY_train_n)
distances, indices = neighbours.kneighbors(XY_train_n)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.ylim(0, 0.2)
plt.xlim(2500, 3500)
plt.yticks(np.arange(0, max(distances), 0.1))
plt.grid(True)
plt.title("K-distance Graph")
plt.xlabel("Data Points")
plt.ylabel("Epsilon")
plt.show()
In [ ]:
neighbours = NearestNeighbors(n_neighbors=10)
neighbours.fit(XY_train_n)
distances, indices = neighbours.kneighbors(XY_train_n)
# Sort distances and take the second column (nearest neighbor distance)
distances = np.sort(distances, axis=0)
distances = distances[:, 1]
# Plot with switched axes
plt.plot(distances, np.arange(len(distances)))
plt.xlim(0, 0.2)
plt.ylim(2500, 3500)
plt.xticks(np.arange(0, max(distances), 0.1))
plt.grid(True)
plt.title("K-distance Graph")
plt.ylabel("Data Points")
plt.xlabel("Epsilon")
plt.show()
In [ ]:
from sklearn import metrics
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.06, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_train_n)
# Reduce dimensions to 3 using PCA for visualization
pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train_n)
# Create a 3D plot
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection="3d")
# Scatter plot
scatter = ax.scatter(
X_train_pca[:, 0],
X_train_pca[:, 1],
X_train_pca[:, 2],
c=dbscan_labels,
cmap="viridis",
alpha=0.7,
)
# Add legend and labels
legend = ax.legend(*scatter.legend_elements(), title="Clusters")
ax.add_artist(legend)
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.set_title("DBSCAN Clustering with PCA")
# Show plot
plt.show()
In [ ]:
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.1, min_samples=10)
dbscan_labels = dbscan.fit_predict(XY_train_n)
# Reduce dimensions to 3 using PCA for visualization
pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(XY_train_n)
# Create a DataFrame for Plotly
df = pd.DataFrame(X_train_pca, columns=["PC1", "PC2", "PC3"])
df["Cluster"] = dbscan_labels
# Create an interactive 3D plot using Plotly
fig = px.scatter_3d(
df,
x="PC1",
y="PC2",
z="PC3",
color="Cluster",
title="DBSCAN Clustering with PCA",
labels={
"PC1": "Principal Component 1",
"PC2": "Principal Component 2",
"PC3": "Principal Component 3",
},
width=800,
height=800,
opacity=0.7,
)
# Show plot
fig.show()
In [ ]:
# use pca to do dimension reduction
pca = PCA(n_components=3, random_state=random)
principalComponents = pca.fit_transform(X_train_n)
# Create a DataFrame with the principal components
principalDf3 = pd.DataFrame(data=principalComponents, columns=['pc1', 'pc2', 'pc3'])
# DBSCAN clustering
db_pca = DBSCAN(eps=0.1, min_samples=10)
db_pca.fit(X_train_n)
y_db_pca = db_pca.labels_
# Create a DataFrame with the cluster labels
df_clustered_pca = principalDf3.copy()
df_clustered_pca['Cluster'] = y_db_pca
fig = px.scatter_3d(
df_clustered_pca,
x="pc1",
y="pc2",
z="pc3",
color=y_db_pca,
opacity=0.7,
width=1000,
height=800,
)
# add fig title
fig.update_layout(title="DBSCAN Clustering with PCA")
fig.show()
Miary¶
In [ ]:
def animate_hiperparameters(dataframe, test, eps_values):
# Perform PCA to reduce to 3 dimensions for visualization
pca = PCA(n_components=3)
principalDf3 = pd.DataFrame(
pca.fit_transform(dataframe), columns=["PC1", "PC2", "PC3"]
)
pca_test = pd.DataFrame(pca.transform(test), columns=["PC1", "PC2", "PC3"])
# Initialize lists to store metrics
silhouette_scores_DBSCAN = []
calinski_harabasz_scores_DBSCAN = []
davies_bouldin_scores_DBSCAN = []
silhouette_scores_DBSCAN_test = []
calinski_harabasz_scores_DBSCAN_test = []
davies_bouldin_scores_DBSCAN_test = []
for eps in eps_values:
dbscan = DBSCAN(eps=eps, min_samples=10)
dbscan.fit(dataframe)
y_dbscan = dbscan.labels_
y_test_dbscan = dbscan.fit_predict(test)
if len(set(y_dbscan)) > 1:
silhouette_scores_DBSCAN.append(silhouette_score(dataframe, y_dbscan))
calinski_harabasz_scores_DBSCAN.append(
calinski_harabasz_score(dataframe, y_dbscan)
)
davies_bouldin_scores_DBSCAN.append(
davies_bouldin_score(dataframe, y_dbscan)
)
else:
silhouette_scores_DBSCAN.append(-1)
calinski_harabasz_scores_DBSCAN.append(-1)
davies_bouldin_scores_DBSCAN.append(-1)
if len(set(y_test_dbscan)) > 1:
silhouette_scores_DBSCAN_test.append(silhouette_score(test, y_test_dbscan))
calinski_harabasz_scores_DBSCAN_test.append(
calinski_harabasz_score(test, y_test_dbscan)
)
davies_bouldin_scores_DBSCAN_test.append(
davies_bouldin_score(test, y_test_dbscan)
)
else:
silhouette_scores_DBSCAN_test.append(-1)
calinski_harabasz_scores_DBSCAN_test.append(-1)
davies_bouldin_scores_DBSCAN_test.append(-1)
fig = plt.figure(figsize=(20, 10))
ax1 = fig.add_subplot(231, projection="3d")
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_zlabel("PC3")
ax1.set_title("Train Clustering with PCA")
ax2 = fig.add_subplot(234, projection="3d")
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
ax2.set_zlabel("PC3")
ax2.set_title("Test Clustering with PCA")
ax3 = fig.add_subplot(232)
ax4 = fig.add_subplot(233)
ax5 = fig.add_subplot(235)
def update(frame):
eps = eps_values[frame]
ax1.cla()
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_zlabel("PC3")
ax1.set_title(f"Train Clustering with DBSCAN - eps={eps:.2f}")
ax2.cla()
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
ax2.set_zlabel("PC3")
ax2.set_title(f"Test Clustering with DBSCAN - eps={eps:.2f}")
dbscan_pca = DBSCAN(eps=eps, min_samples=5)
y_dbscan_pca = dbscan_pca.fit_predict(principalDf3)
y_dbscan_pca_test = dbscan_pca.fit_predict(pca_test)
df_clustered_pca = principalDf3.copy()
df_clustered_pca["Cluster"] = y_dbscan_pca
df_clustered_pca_test = pca_test.copy()
df_clustered_pca_test["Cluster"] = y_dbscan_pca_test
scatter1 = ax1.scatter(
df_clustered_pca["PC1"],
df_clustered_pca["PC2"],
df_clustered_pca["PC3"],
c=df_clustered_pca["Cluster"],
cmap="viridis",
alpha=0.7,
)
legend1 = ax1.legend(*scatter1.legend_elements(), title="Clusters")
ax1.add_artist(legend1)
scatter2 = ax2.scatter(
df_clustered_pca_test["PC1"],
df_clustered_pca_test["PC2"],
df_clustered_pca_test["PC3"],
c=df_clustered_pca_test["Cluster"],
cmap="viridis",
alpha=0.7,
)
legend2 = ax2.legend(*scatter2.legend_elements(), title="Clusters")
ax2.add_artist(legend2)
ax3.cla()
ax4.cla()
ax5.cla()
ax3.plot(
eps_values, silhouette_scores_DBSCAN, marker="o", color="b", label="Train"
)
ax3.plot(
eps_values,
silhouette_scores_DBSCAN_test,
marker="o",
color="k",
label="Test",
)
ax3.set_title("Silhouette Score")
ax3.set_xlabel("Epsilon")
ax3.set_ylabel("Score")
ax3.grid(True)
ax3.axvline(x=eps, color="r", linestyle="-.")
ax3.legend()
ax4.plot(
eps_values,
calinski_harabasz_scores_DBSCAN,
marker="o",
color="r",
label="Train",
)
ax4.plot(
eps_values,
calinski_harabasz_scores_DBSCAN_test,
marker="o",
color="k",
label="Test",
)
ax4.set_title("Calinski Harabasz Score")
ax4.set_xlabel("Epsilon")
ax4.set_ylabel("Score")
ax4.grid(True)
ax4.axvline(x=eps, color="r", linestyle="-.")
ax4.legend()
ax5.plot(
eps_values,
davies_bouldin_scores_DBSCAN,
marker="o",
color="g",
label="Train",
)
ax5.plot(
eps_values,
davies_bouldin_scores_DBSCAN_test,
marker="o",
color="k",
label="Test",
)
ax5.set_title("Davies Bouldin Score")
ax5.set_xlabel("Epsilon")
ax5.set_ylabel("Score")
ax5.grid(True)
ax5.axvline(x=eps, color="r", linestyle="-.")
ax5.legend()
ani = animation.FuncAnimation(
fig, update, frames=range(len(eps_values)), repeat=True
)
ani.save("dbscan_pca.gif", writer="imagemagick", fps=1)
eps_values = np.linspace(0.01, 2.0, 10) # Epsilon values range
animate_hiperparameters(XY_train_n, XY_test_n, eps_values)
from IPython.display import Image
Image(filename="dbscan_pca.gif")
MovieWriter imagemagick unavailable; using Pillow instead.
Out[ ]:
<IPython.core.display.Image object>
In [ ]:
eps_values = np.linspace(0.01, 0.2, 10) # Epsilon values range
animate_hiperparameters(XY_train_n, XY_test_n, eps_values)
from IPython.display import Image
Image(filename="dbscan_pca.gif")
MovieWriter imagemagick unavailable; using Pillow instead.
Out[ ]:
<IPython.core.display.Image object>
In [ ]:
def animate_dbscan_hiperparameters_minsamples(dataframe, test, eps, min_samples_values):
pca = PCA(n_components=3)
principalDf3 = pd.DataFrame(
pca.fit_transform(dataframe), columns=["PC1", "PC2", "PC3"]
)
pca_test = pd.DataFrame(pca.transform(test), columns=["PC1", "PC2", "PC3"])
silhouette_scores_DBSCAN = []
calinski_harabasz_scores_DBSCAN = []
davies_bouldin_scores_DBSCAN = []
silhouette_scores_DBSCAN_test = []
calinski_harabasz_scores_DBSCAN_test = []
davies_bouldin_scores_DBSCAN_test = []
for min_samples in min_samples_values:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
dbscan.fit(dataframe)
y_dbscan = dbscan.labels_
y_test_dbscan = dbscan.fit_predict(test)
if len(set(y_dbscan)) > 1:
silhouette_scores_DBSCAN.append(silhouette_score(dataframe, y_dbscan))
calinski_harabasz_scores_DBSCAN.append(
calinski_harabasz_score(dataframe, y_dbscan)
)
davies_bouldin_scores_DBSCAN.append(
davies_bouldin_score(dataframe, y_dbscan)
)
else:
silhouette_scores_DBSCAN.append(-1)
calinski_harabasz_scores_DBSCAN.append(-1)
davies_bouldin_scores_DBSCAN.append(-1)
if len(set(y_test_dbscan)) > 1:
silhouette_scores_DBSCAN_test.append(silhouette_score(test, y_test_dbscan))
calinski_harabasz_scores_DBSCAN_test.append(
calinski_harabasz_score(test, y_test_dbscan)
)
davies_bouldin_scores_DBSCAN_test.append(
davies_bouldin_score(test, y_test_dbscan)
)
else:
silhouette_scores_DBSCAN_test.append(-1)
calinski_harabasz_scores_DBSCAN_test.append(-1)
davies_bouldin_scores_DBSCAN_test.append(-1)
fig = plt.figure(figsize=(20, 10))
ax1 = fig.add_subplot(231, projection="3d")
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_zlabel("PC3")
ax1.set_title("Train Clustering with PCA")
ax2 = fig.add_subplot(234, projection="3d")
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
ax2.set_zlabel("PC3")
ax2.set_title("Test Clustering with PCA")
ax3 = fig.add_subplot(232)
ax4 = fig.add_subplot(233)
ax5 = fig.add_subplot(235)
def update(frame):
min_samples = min_samples_values[frame]
ax1.cla()
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_zlabel("PC3")
ax1.set_title(
f"Train Clustering with DBSCAN - eps={eps:.2f} - min_samples={min_samples}"
)
ax2.cla()
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
ax2.set_zlabel("PC3")
ax2.set_title(
f"Test Clustering with DBSCAN - eps={eps:.2f} - min_samples={min_samples}"
)
dbscan_pca = DBSCAN(eps=eps, min_samples=min_samples)
y_dbscan_pca = dbscan_pca.fit_predict(principalDf3)
y_dbscan_pca_test = dbscan_pca.fit_predict(pca_test)
df_clustered_pca = principalDf3.copy()
df_clustered_pca["Cluster"] = y_dbscan_pca
df_clustered_pca_test = pca_test.copy()
df_clustered_pca_test["Cluster"] = y_dbscan_pca_test
scatter1 = ax1.scatter(
df_clustered_pca["PC1"],
df_clustered_pca["PC2"],
df_clustered_pca["PC3"],
c=df_clustered_pca["Cluster"],
cmap="viridis",
alpha=0.7,
)
legend1 = ax1.legend(*scatter1.legend_elements(), title="Clusters")
ax1.add_artist(legend1)
scatter2 = ax2.scatter(
df_clustered_pca_test["PC1"],
df_clustered_pca_test["PC2"],
df_clustered_pca_test["PC3"],
c=df_clustered_pca_test["Cluster"],
cmap="viridis",
alpha=0.7,
)
legend2 = ax2.legend(*scatter2.legend_elements(), title="Clusters")
ax2.add_artist(legend2)
ax3.cla()
ax4.cla()
ax5.cla()
ax3.plot(
min_samples_values,
silhouette_scores_DBSCAN,
marker="o",
color="b",
label="Train",
)
ax3.plot(
min_samples_values,
silhouette_scores_DBSCAN_test,
marker="o",
color="k",
label="Test",
)
ax3.set_title("Silhouette Score")
ax3.set_xlabel("Min Samples")
ax3.set_ylabel("Score")
ax3.grid(True)
ax3.axvline(x=min_samples, color="r", linestyle="-.")
ax3.legend()
ax4.plot(
min_samples_values,
calinski_harabasz_scores_DBSCAN,
marker="o",
color="r",
label="Train",
)
ax4.plot(
min_samples_values,
calinski_harabasz_scores_DBSCAN_test,
marker="o",
color="k",
label="Test",
)
ax4.set_title("Calinski Harabasz Score")
ax4.set_xlabel("Min Samples")
ax4.set_ylabel("Score")
ax4.grid(True)
ax4.axvline(x=min_samples, color="r", linestyle="-.")
ax4.legend()
ax5.plot(
min_samples_values,
davies_bouldin_scores_DBSCAN,
marker="o",
color="g",
label="Train",
)
ax5.plot(
min_samples_values,
davies_bouldin_scores_DBSCAN_test,
marker="o",
color="k",
label="Test",
)
ax5.set_title("Davies Bouldin Score")
ax5.set_xlabel("Min Samples")
ax5.set_ylabel("Score")
ax5.grid(True)
ax5.axvline(x=min_samples, color="r", linestyle="-.")
ax5.legend()
ani = animation.FuncAnimation(
fig, update, frames=range(len(min_samples_values)), repeat=True
)
ani.save("dbscan_min_samples.gif", writer="Pillow", fps=1)
min_samples_values = range(1, 10)
eps = 0.1
animate_dbscan_hiperparameters_minsamples(
XY_train_n, XY_test_n, eps, min_samples_values
)
Image(filename="dbscan_min_samples.gif")
MovieWriter Pillow unavailable; using Pillow instead.
Out[ ]:
<IPython.core.display.Image object>
Pytania Pomocnicze
- Czy przy grupowaniu potrzebna jest normalizacja/standaryzacja danych?
- Co różni oba algorytmy z punktu widzenia reprezentacji klastra?
- Który z algorytmów jest mniej odporny na szum i wartości odstające (ang. outliers)? Dlaczego?
- Czy w zadaniu grupowania powinniśmy użyć walidacji krzyżowej?
- Czy wyniki badanych algorytmów klasteryzacji powinny być powtarzane i uśredniane?
- Co mierzą miary klasteryzacji podane w treści zdania?
- Jak algorytmy zachowują się dla skrajnych wartości ilości klastrów?